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2.  Project  Scope 

The  objective  of  this  proposed  research  effort  is  to  conduct  a  feasibility  study  of  in-vehicle 
Group  Activities  (iVGA)  and  develop  suitable  context-based  taxonomy  and  ontology 
schemas  for  coherent  analysis  and  inferencing  of  such  activities.  The  prime  objective  of 
feasibility  study  is  to  identify  suitable  automated  computer  vision-based  techniques 
enabling  low  false  alarm  rate  identification  of  iVGA.  Another  objective  is  to  develop  a 
suitable  virtual  simulation  environment  as  a  test  bed  for  staging  appropriate  context - 
based  scenarios  exhibiting  different  iVGA  for  testing  and  evaluation  of  newly  developed 
ontology,  algorithms  and  techniques.  The  goal  is  map  visual  observations  of  iVGA  to 
annotated  messages  describing  spatiotemporal  behaviors  exhibited  in  the  confined  space 
of  a  vehicle  in  lieu  of  physical  obstructions  and  inevitable  occlusions.  In  this  project, 
partially  observable  spatiotemporal  imagery  data  are  analyzed  to  reason,  predict,  and 
discriminate  group  activities  occurring  in  confined  spaces. 


3.  Introduction 

The  asymmetric  adversarial  group  activities  are  generally  embedded  in  extensive  "clutter” 
of  urban  environments  and  may  involve  well  trained  individuals  exploiting  different  tactics 
to  execute  tasks  pertaining  to  their  mission  while  organizing  themselves  in  such  a 
way  that  obscures  the  detection  of  their  cover,  access,  sources,  and  responsibilities.  The 
ability  to  identify  adversarial  intent  of  group  activities  based  on  the  analysis  of  their 
physical  and  interaction  behavioral  patterns  would  significantly  improve  asymmetric 
counter-insurgency  and  peace-keeping  operations.  At  present,  adversarial  intent  of  group 
activities  are  judged  by  soldiers  observing  such  group  activities  which  often  entail 
significant  danger  and  possibly  high  false-positive  and  false-negative  rates  [1].  Ability  to 
automatically  identify  adversarial  intent  via  fusion  of  data  from  multi-source  sensors  will 
yield  a  unique  opportunity  to  reduce  false-alarm  rates  and  facilitate  discovery  of 
potentially  perilous  group  activities  while  helping  to  shift  the  balance  in  peace-keeping 
operations. 

4.  Project  Motivation 

The  proposed  project  was  motivated  by  the  realization  that  numerous  group  activities  may 
happen  under  certain  circumstances  that  limits  our  inclusive  visibility  of  their  operations. 
Since  many  of  such  group  activities  are  monitored  by  surveillance  cameras,  there  is  a  great 
need  for  robust  techniques  facilitating  automatic  analysis  and  interpretation  of  imagery 
data  associated  with  such  group  activities  and  generating  pertinent  information  about  their 
nature  of  operations. 

For  instance,  some  such  group  activities  might  occur  inside  or  within  a  close  proximity  to 
a  vehicle/boat/airplane,  or  at  corners  of  an  alley  in  an  urban  environment,  or  at  a 
security  bunkers,  or  behind  large  obstructions  in  an  urban  environment,  or  at  a  border 
location  where  only  partial  visibility  of  group  activities  is  available.  An  exemplary  case  is 
that  of  the  in-vehicle  group  activity  participated  by  a  team  of  insurgents.  The  state  of 
behavioral  patterns  of  insurgents  in  such  a  situation  can  be  divided  into  multiple 
phases:  An  "arrival  phase",  where  insurgents  approach  the  vehicle  from  different  sides 
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either  simultaneously  or  at  different  time  interval;  an  "in-vehicle  phase",  where  some 
interactions  among  participating  insurgents  take  place  inside  the  vehicle,  and  a  "departure 
phase",  where  insurgents  depart  the  vehicle  in  a  certain  pattern.  By  spatiotemporal 
inferencing  and  reasoning  of  such  observations,  the  adversarial  intent  of  such  insurgency 
group  activity  can  be  potentially  identified.  In  this  project,  we  are  interested  to  develop 
robust  imagery  techniques  facilitating  better  tracking  and  improve  techniques  needed  for 
better  understanding  of  In-Vehicle  Group  Activities  [/VGA]  under  partial  visibility 
constraints. 


5.  Scope  of  This  Feasibility  Study 

This  project  has  two  phases.  This  report  covered  our  research  progress  toward  the  first 
phase  of  this  project  called  the  feasibility  study.  The  second  phase  of  this  project  is 
out  scope  of  this  report  and  therefore  is  not  discussed  in  the  body  of  this  report. 

This  feasibility  study  is  further  divided  into  two  parts,  a  theoretical  feasibility  study  and  an 
experimental  feasibility  study.  The  theoretical  study  will  begin  with  development  of 
context-based  taxonomy  and  ontology  schema  for  coherent  analysis  and  inferencing  of 
the  iVGA.  One  other  objective  of  this  theoretical  study  is  to  develop  suitable  adaptive 
image  processing  techniques  for  characterization  of  iVGA. 

By  taxonomy,  we  refer  to  classification  of  iVGA  systematically.  By  ontology,  we  mean 
formal  representation  of  knowledge  in  the  form  of  a  collection  of  concepts  within 
a  taxonomy  domain  and  the  relationships  among  such  concepts.  The  ontology-based 
concepts  can  be  used  to  reason  about  involving  entities  and  their  relationships  within 
the  specified  taxonomy  domain.  In  study  of  the  iVGA,  learning  to  associate  and  correlate 
sub-group  activities  is  the  key  for  being  able  to  differentiate  normal  behavioral  patterns 
from  abnormal  ones. 

The  theoretical  effort  of  this  project  develops  ontological  models  is  intent  to  cope  pattern 
templates  useful  for  detection,  recognition,  and  discovery  of  a  potential  iVGA  adversarial 
intent.  Another  aspect  of  this  preliminary  feasibility  study  is  to  establish  an  iVGA  ontology 
model  based  on  which  an  automatic  surveillance  system  can  properly  relate  its 
observational  information  discovery  [i.e.,  evidential  cues]  to  recognition  of  iVGA  that  may 
pertain  an  adversarial  intent. 


6.  Research  Accomplishments 

The  research  accomplishments  of  this  project  are  multiple  fold.  The  followings  present  a 
summary  of  our  research  accomplishments  including: 

6.1  Development  of  Space-state  Model  for  Occupants  Inside  a  Vehicle  -  The  space- 

state  allows  to  recognition  normal  zones  in  the  physical  environment 
that  we  expect  the  normal  present  of  different  body  part  of  the 
passenger,  e.g.,  head,  arms,  and  hands.  Figure  1  presents  the  space-state 
zones  corresponding  to  normal  spaces  where  a  driver  head  and  forearm  is 
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expected  during  normal  operation  inside  a  vehicle.  Figure  2  illustrates  the 
designated  spaces  where  the  hand  will  be  under  normal  operations  inside  a 
vehicle.  Figure  3  shows  the  designated  zones  where  head  and  hands  will  be 
expected  during  certain  operations  inside  the  vehicle.  Figure  4  demonstrates  the 
designated  spaces  for  cameras  and  video  camcorders  that  are  determined 
kinematically  sensible  [i.e.,  comfortable)  configurations  for  taking  pictures  and 
videoing  from  inside  a  vehicle.  These  space  states  facilitate  inferring  and 
discriminating  normal  postural  configuration  of  passenger  inside  vehicle  from 
variance  postural  configuration  that  may  have  security  concern  inference. 


Space-State  Zoning  of  Head, 
Upper-Arm,  and  Forearm 


o 

a 

« 


Divides  Window  60/40  proportionally 
Drawn  Parallel  to  Window  Edge 


Head  Movability Zone (HMZ) 

X*-  Back  Shoulder  Reach  Zone  (XBRZ) 


X-Upper-Arm  Movability  Zone  (XUMZ) 

Translates  forward  with  forward  movement  of  the  head 


X-Forearm  Movability  Object-Exchange  Zone  (XMOZ) 


X-Forearm  Movability  Zone  (XFMZ) 

Region  of  passingan  objectto  backseat  passengers 


X  =  (Left  or  Right) 


Figure  1.  Space-state  Zoning  of  Driver  Inside  a  Vehicle. 

Space-State  Zoning  of  Hands 


Divides  Window  60/40  proportionally 
Drawn  Parallel  to  Window  Edge 
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Hand(s)  on  The  Steering  Wheel  Zone  (HSWZ) 

Hand(s)  Lower  Working  Zone  (HLWZ) 
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Hand(s)  Ceiling  Working  Zone  (HCWZ) 

Hand(s)  Holding  Comm  Device  Zone  (HHCZ) 

Hand(s)Tempering  With  VehicleZone  (HTVZ) 


Figure  2.  Space  State  of  Driver  Hand  Inside  a  Vehicle 
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Space-State  Zoning  of 
Head  and  Forearm  (Cont.) 
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Drawn  Parallel  to  Window  Edge 
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X-Hand  AdjustingSide  Mirror  Zone  (XHAZ) 
X-Arm  Object-DroppingZone  (XODZ) 


•  X  =  (Left  or  Right) 

•  Y=  (Right  or  Left)  =  Opposite  to  X 


Figure  3.  Space-state  Zoning  of  Driver  Head  and  Forearm  Inside  a  Vehicle. 


Space-State  Zones 
For  Special  Applications 
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C3  Hands  Taking  Video  Zone(HTVZ) 


up 


Shoulider  Does  not  Move,  but  Head  Turns  90°  TalkingTo  Next  Person 
ShoulderTurns  and  Head  Turns  90°-^  Delivering  an  Object 
ShoulderTurns  and  Head  Turns  180°  Talking  to  Backseat  Passengers 
Head  Turns  -45°  Talking  to  People  Outside  of  Vehicle 

Figure  4.  Space-state  Zoning  of  Driver  For  Special  Applications,  e.g.,  taking  picture, 

videoing  a  scene. 


6.2  Development  of  Ontology  for  Inferring  Occupants'  Postural  Configurations 
Inside  Vehicle  -  These  newly  developed  ontology  allow  to  infer  certain  specific 
postural  configuration  of  the  passenger  inside  vehicle  to  certain  normal  operation 
inside  the  vehicle,  e.g.,  adjust  rear  viewing  mirror,  turning  steering  wheel,  resting 
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arm  against  vehicle  door  window  edge,  or  extending  arm  outside  vehicle  through 
window  opening  to  drop  something  outside.  Figure  5  illustrates  kinematic  division 
of  an  arm  with  three  distinctive  parts:  Upper  Arm,  Forearm,  and  Head.  As 
demonstrate  each  arm  segment  at  any  given  kinematic  configuration  may  be 
described  via  a  tri-state  notation.  Up,  Mid,  or  Down.  This  implies  that  one  can  use 
nine  degrees  of  freedom  to  describe  the  kinematic  postural  configuration  of  an  arm. 
Figure  6  illustrates  the  kinematic  configurations  of  upper  arm  relative  to  the  physical 
world  [i.e.,  vehicle].  Figure  7  presents  the  taxonomy  of  iVGA.  The  upper  graph  in  the 
figure  7  presents  the  tri-state  of  vehicle  (Parked  (i.e.,  stationary).  Arriving,  or  departing 
(i.e.,  leaving).  Under  each  state,  the  expectation  is  either  the  vehicle  is  occupied  or  not 
occupied.  If  vehicle  is  found  occupied,  then,  the  passengers  could  be  considered  as 
either  driver  or  passengers.  The  driver  is  always  found  inside  and  in  front  of  vehicle. 
Whereas,  the  other  passengers  may  be  found  sitting  inside,  and  in  front  of  vehicle  next  to 
the  driver,  or  sitting  in  the  back  seat  right  behind  the  frontal  driver  and  passenger. 
Figure  8  illustrates  the  taxonomy  of  whereabouts  of  the  driver  and  the  Passenger(s)  if 
found  outside  of  vehicle.  Figure  9  presents  an  ontology  chart  of  an  occupied  vehicle 
with  driver  and  passengers(s)  when  the  driver  and  passengers  head  and  arms  are  either 
visible  or  invisible.  Note  that  when  either  head  or  arm  is  invisible,  no  specific  deduction 
is  considered.  And  when  the  arms  are  visible,  the  arms  may  be  either  moving  or 
stationary  (i.e.,  not  moving).  In  the  latter  situation,  each  state  implies  its  own  inferences. 
Figure  10  summarizes  the  ontology  of  a  driver  head  in  visible  state  inside  a  vehicle.  This 
figure  illustrates  the  inference  of  driver  head  while  turning,  or  looking  down,  looking  up, 
and  looking  straight  ahead. 

Arm  Sections  and  Directions 


•  Arm  will  be  broken  into  3  sections 


Figure  5.  Three-segmented  regions  of  an  arm  [left  illustration)  and  tri-state  of  arm  kinematic 

configuration  [right  illustration). 
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Upper  Arm  Orientation 


Figure  6.  Three  Sections  of  Driver’s  Side  Window,  Upper  Part,  Middle  Part,  and  Down  Side.  Note 
the  dividing  lines  are  adjustable  parameters  that  can  be  decided  manually  for  the  user  or 
automatically  by  the  image  processing  algorithms  attempt  to  interpret  an  iVGA  scenario. 


Figure  7.  Taxonomy  of  iVGA  -  Upper  figure  shows  the  tri-state  of  vehicle  (Parked  (i.e., 
stationary).  Arriving,  or  departing  (i.e.,  leaving).  Under  each  state,  the  expectation  is  either  the 
vehicle  is  occupied  or  not  occupied.  If  vehicle  is  found  occupied,  then,  the  passengers  could  be 
considered  as  either  driver  or  passengers.  The  driver  is  always  found  inside  and  in  front  of 
vehicle.  Whereas,  the  other  passengers  may  be  found  sitting  inside,  and  in  front  of  vehicle  next 
to  the  driver,  or  sitting  in  the  back  seat  right  behind  the  frontal  driver  and  passenger. 
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Figure  9.  Ontology  of  an  occupied  vehicle  with  driver  and  passengers(s)  where  theirs  head  and 
arms  are  either  visible  or  invisible.  Note  that  when  either  head  or  arm  is  invisible,  no  specific 
deduction  is  considered.  And  when  the  arms  are  visible,  the  arms  may  be  either  moving  or 
stationary  (i.e.,  not  moving].  In  the  latter  situation,  each  state  implies  its  own  inferences. 
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Turned 

Outside 


Watching 

Outside 


Figure  10.  Ontology  of  a  driver  head  in  visible  state  and  implication  of  driver  head  while 
turning,  or  looking  down,  looking  up,  and  looking  straight  ahead.  When  the  head  is  turned,  it 
may  be  turning  inside  or  outside  or  "reversed".  The  latter  implies  to  a  situation  where  the  driver 
may  be  talking  to  a  passenger  located  in  the  back  seat.  The  deductions  from  the  head 
configuration  in  state  of  down  and  up  are  plausible  inferences  that  may  require  further 

interrogation. 


Figure  11  presents  the  ontology  of  the  driver  arm  is  moving  in  two  states:  (1]  free  state 
and  (2]  holding  state.  In  the  free-state,  the  moving  arm  may  be  pointing  upward  to 
downward.  Implication  of  either  state  is  modeled  in  the  left  leaf  of  chart  in  the  figure  11. 
In  the  state  of  holding  an  object,  the  action  of  object  holding  may  have  accomplished 
either  by  one  hand  or  two  hands.  Each  state,  again,  has  its  own  implication  as 
illustratively  demonstrated  by  the  right  leaf  of  chart  in  the  figure  11.  For  instance,  an 
action  to  phoning  is  involved  with  movement  of  an  arm  while  holding  an  object  [e.g.,  a 
phone]  and  positioning  it  somewhere  around  the  head.  This  composition  of  the  arm 
moving  can  be  concluded  that  a  phone  call  is  being  placed  once  such  actions  are 
observed.  On  the  other  hand,  when  an  object  is  held  by  two  hands,  the  hands  may  be 
found  either  below  the  chest  area,  around  the  head  area,  or  above  the  head  area.  Each 
situation  has  its  own  implication  as  captured  by  the  right  leaf  of  chart  in  the  figure  11. 
For  instance,  a  situation  where  two  hands  of  the  drivers  are  found  moving  toward  his 
face  area,  may  indicate  that  perhaps,  the  driver  is  either  adjusting  his/her  eyeglasses, 
adjusting  his/her  makeup,  or  possibly  attempt  to  hiding  his/her  face  from  getting 
recognized.  In  normal  situation,  for  instance,  it  is  not  customarily  for  a  person  to  use  two 
hands  around  his/her  face  area.  This  does  not  mean  it  is  not  absolutely  impossible,  but 
rather  expresses  its  out  of  normality.  Moreover,  when  the  hand  are  found  above  the  face 
and  over  the  head,  one  possible  logical  explanation  of  that  is  that  it's  likely  that 
individual  is  fixing  his/her  hair  or  adjusting  his/her  hair  cover.  Note  that  ontology 
describe  here  can  be  decisive  in  terms  of  explanation  normality  of  postural  states  of  an 
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individual  inside  a  vehicle.  Many  exceptions  can  occur  that  are  not  deemed  significant  to 
be  include  in  the  ontology  charts.  In  this  project,  we  are  after  modeling  normal 
situations,  and  assumption  is  that  anything  outside  this  normality  regime  may  be 
anomalous  and  subject  to  further  examination  and  scrutiny  through  this  analysis.  Figure 
12  illustrates  the  ontology  of  situation  a  drive  arm  are  found  visible  but  not  moving. 
Figure  13  presents  the  notations  developed  for  describing  the  postural  states  of  arms. 


Backseat  1  i*/  i  ■ 

Passenger  M  M  forking 


Hiding 

Face 


Figure  11.  Ontology  for  state  of  the  driver  arm  moving. 


Arms  Not  Moving 
(Partially  Visible) 


Down  at 
Side 


^  On  Steering  ' 

On  Side 

Against 

f  On  Window  ' 

L  Wheel 

V 

Chair 

_ _ _ J 

Head 

1  Edge 

Figure  12.  Ontology  of  a  driver  arm  detected  but  found  not  moving. 
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Observations 


Left  Arm 


J 

[  Head  | 

Q  Q  * 

Right  Arm  j 

Up(U)  I  [  Down  [D3 1  [  Straight  (S)  |  [~U~]  [~D~]  fT]  [u]  [s] 


Figure  13.  Ontology  notations  for  describing  the  postural  states  of  arms. 


Figure  14  presents  an  Arm  Gesture  Temporal  Modeling.  A  typical  arm  movement  can  be 
described  by  a  three  temporal  states  including:  Resting  [R),  Transitioning  [T),  and 
Stroking  (S).  Each  state  is  realized  with  a  time  period.  Arms  prefer  to  come  to  rest  after 
each  movement.  This  is  due  to  conservation  of  energy  and  improving  extra  load  bearing 
of  the  biological  states.  Arms  in  extended  configuration  typically  require  more  energy  for 
their  control  and  hence  a  tiring  kinematic  configuration  for  the  human  being.  However, 
arms  are  rest,  consumed  less  energy  and  cause  no  additional  burden  for  the  brain  to 
monitored  their  control.  A  much  elaborated  arms  movement  may  involve  a  combination 
of  transitioning,  namely,  moving  from  one  state  or  another  in  order  to  accomplishment 
some  action,  or  then  it  typically  comes  to  rest  via  a  rapid  stroke  to  ease  pressure  to 
muscles  controlling  the  arm  in  the  extended  kinematic  configurations.  Figure  15  presents 
the  notations  for  describing  arms  and  hands  postural/gestural  configuration. 

Arm  Gestures  Temporal 
Modeling 

Arm-Gesture  Temporal  States 

•  Resting  (R) 

•  Transitioning  (T) 

•  Stroking  (S) 

State 


R 


Time  Duration 
R  >  1  (unit  time  (ut)) 

T>  2  (ut) 

S  =  1  (ut) 

Figure  14.  Arm  Gesture  Temporal  Modeling  Via  Three  States:  Resting  (R), 
Transitioning  (T),  and  Stroking  (S). 
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Temporal  States  of  Right  Arm 


Figure  15.  Ontology  for  notations  for  arms  and  hands  postural/gestural  configuration. 


Figure  16  presents  the  ontology  of  right  upper  arms  of  the  driver  in  three  states  of  up, 
middle,  and  down  configuration.  A  typical  person  uses  his/her  arm  to  do  specific  functions 
while  seating  in  the  driver  seat.  For  example,  adjust  rear  viewing  mirror  is  highly  likely  is 
done  using  right  arm  than  left  arm.  When  an  upper  arm  is  observed,  that  observation  alone, 
if  confirmed  and  found  in  either  up,  middle  or  down  kinematic  configuration,  then,  the 
possible  functions  that  person  may  be  performing  is  captured  in  the  state  of  model  of  the 
ontology  chart  in  the  Figure  16.  This  ontology  model  excludes  operations  that  are  not 
kinematically  feasible  or  unlikely  to  be  permissible  to  be  performed  with  the  upper  arm 
while  it  is  in  any  other  unspecified  states.  Figure  17  presents  ontology  developed  to  note 
the  forearm  postural/gestural  configurations.  Figure  18  presents  the  ontology  of  right 
upper  arms  of  the  driver  in  three  states  of  up,  middle,  and  down  configuration.  Similar  to 
the  description  provided  for  the  Figure  16,  figure  18  also  captures  the  essence  of  forearm 
kinematic  configuration  and  its  implication  when  detected  in  either  of  those  known  states. 
For  example,  this  ontology  reveals  that  for  adjusting  rear  viewing  mirror,  for  example,  the 
forearm  must  be  seen  in  an  upper  configuration.  This  also  supports  the  previous  ontology 
for  the  right  upper  arm  configurational  requirement  for  the  same  function.  Namely,  for 
adjusting  rear  viewing  mirror,  both  the  right  upper  arm  and  the  right  forearm  must  be 
detected  in  the  upward  configuration,  otherwise  any  violation  of  this  requirement  does  not 
grant  the  perception  of  this  operation.  Moreover,  when  the  right  upper  arm  and  the  forearm 
are  found  in  the  upward  configuration  and  toward  the  middle  upper  center  of  vehicle  toward 
the  front  this  grant  the  perception  of  reaching  for  the  rear  viewing  mirror  and  attempting  to 
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modify  it.  This  is  a  knowledge-based  postulation  that  may  require  further  reinforcement  via 
supplementary  examination  and  scrutiny  of  this  operation. 


Ontology  of  Right  Upper-Arm 
Configurations- An  Example 


Note:  Blue  outline 
means  unique  task 
for  that  arm 


[  Upper  Arm 
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Eating 


Adjust  HVAC  ) 


Phoning 


Working 
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Eating 


Delivery 


Turning 

Steering  Wheel 


Figure  16.  Ontology  of  Right  Upper-Arm  in  three  states  of  Up,  Middle,  and  Down. 


Temporal  States  of  Right 
Forearm 


Figure  17.  Ontology  for  notations  for  Forearm  postural/gestural  configuration. 
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Ontology  of  Right  Forearm 
Temporal  Gestures 


Note;  Blue  outline 
means  unique  task 
for  that  arm 
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Figure  18.  Ontology  of  Right  Forearm  in  three  states  of  Up,  Middle,  and  Down. 


Figure  19  presents  the  ontology  of  hand.  Figure  20  presents  the  ontology  of  temporal 
configuration  assumable  by  hand  in  its  open  or  closed  gestural  configuration.  Human 
hands  are  very  dexterous  and  used  for  performing  tremendous  number  of  functions 
particularly  for  manipulation  objects.  Inside  the  confined  space  of  a  vehicle,  right  hand 
and  left  hand  are  unlikely  to  be  duplicating  each  other  normal  function  as  far  as  the  basics 
of  operations  of  vehicles  are  concerned.  For  instances,  turning  on/off  the  interior  lights 
of  the  vehicle  are  highly  likely  performed  by  the  right  hand  because  of  its  spatial 
closeness  to  the  ceiling  lights  of  the  vehicle  that  are  always  centrally  situated  inside  the 
vehicle,  instead  of  using  left  hand.  Such  operation  is  not  a  matter  of  preference,  but  rather 
again,  a  matter  of  energy  conservatism  that  is  in  inherent  attributes  of  biological  systems. 
On  the  other  hand,  many  actions  of  an  object  manipulation  may  be  handled  by  either 
hand.  Under  this  ontology  model  the  manipulative  configurations  of  the  hand  is 
collectively  considered  as  "working”  configuration  with  no  further  refinement  because  of 
kinematic  complexity  of  hands  that  makes  their  configurational  identification  ambiguous. 

Similarly  for  the  left  arm,  we  developed  the  complimentary  ontology  for  the  left  upper 
arm,  left  forearm,  and  left  hand.  These  ontologies  are  demonstrated  in  Figures  19  thru 
26  and  are  self-explanatory. 
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Temporal  Space-States  of 
Right-Hand 


Figure  19.  Ontology  for  notations  for  hand  gestural  configuration. 
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Temporal  Gestures 
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Figure  20.  Ontology  of  Right  Hand  in  two  states  of  open  and  close. 
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Temporal  States  of  Left 
Upper-Arm 


Figure  21.  Ontology  of  left  upper  arm. 

Ontology  of  Left  Upper-Arm 
Temporal  Gestures 


Figure  22.  Ontology  of  left  upper  arm  for  three  states  of  Up,  Middle,  and  Down. 
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Temporal  States  of  Left 
Forearm 


Ontology  of  Left  Forearm 
Temporal  Gestures 


Figure  24.  Ontology  of  Left  Forearm  for  three  states  of  Up,  Middle,  and  Down. 


Page  19  of  49 


27 


Temporal  States  of  Left  Hand 


Figure  25.  Ontology  of  Left  Hand. 

Ontology  of  Left  Hand 
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Figure  26.  Ontology  of  left  hand  for  two  state  of  open  and  close. 


6.3  Development  of  Arms  Kinematic  Configuration  Admissibility  (AKCA)  Decision  Trees 

The  newly  developed  ontology  for  left  and  right  arm  only  consider  arm  postural 
configuration  that  are  kinematically  achievable.  However  there  are  many  combination 
of  arm  kinematic  configurations  that  not  achievable.  Categorically,  we  call  this  category 
of  kinematically  unachievable  arm  postural  configurations  as  "kinematically  admissible 
arm  postural  configurations”.  These  inadmissible  arm  postural  configurations  are 
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traceable  and  we  managed  to  capture  these  singularities  an  AKCA  decision  tree  as 
shown  in  Figure  27. 

To  arrive  at  the  AKCA  decision  tree,  we  employ  a  technique  called  1D3.  In  decision  tree 
learning,  1D3  (Iterative  Dichotomiser  3)  is  an  algorithm  invented  by  Ross  Quinlan  [1] 
used  to  generate  a  decision  tree  from  a  dataset.  1D3  is  the  precursor  to  the  C4.5 
algorithm,  and  is  typically  used  in  the  machine  learning  and  natural  language 
processing  domains. 

The  1D3  algorithm  begins  with  the  original  set  Sas  the  root  node.  On  each  iteration  of 
the  algorithm,  it  iterates  through  every  unused  attribute  of  the  set  S'and  calculates  the 
entropy  ^(*^)(or  information  gain  of  that  attribute.  It  then  selects  the  attribute 

which  has  the  smallest  entropy  (or  largest  information  gain]  value.  The  set  is  then 
split  by  the  selected  attribute  (e.g.  age  <  50,  50  <=  age  <  100,  age  >=  100]  to  produce 
subsets  of  the  data.  The  algorithm  continues  to  re-curse  on  each  subset,  considering 
only  attributes  never  selected  before.  Recursion  on  a  subset  may  stop  in  one  of  these 
cases: 

•  Every  element  in  the  subset  belongs  to  the  same  class  (+  or  -],  then  the  node  is  turned 
into  a  leaf  and  labeled  with  the  class  of  the  examples 

•  There  are  no  more  attributes  to  be  selected,  but  the  examples  still  do  not  belong  to 
the  same  class  (some  are  +  and  some  are  -],  then  the  node  is  turned  into  a  leaf  and 
labeled  with  the  most  common  class  of  the  examples  in  the  subset 

•  There  are  no  examples  in  the  subset,  this  happens  when  no  example  in  the  parent  set 
was  found  to  be  matching  a  specific  value  of  the  selected  attribute,  for  example  if 
there  was  no  example  with  age  >=  100.  Then  a  leaf  is  created,  and  labelled  with  the 
most  common  class  of  the  examples  in  the  parent  set. 

Details  of  the  1D3  is  summerized  in  Figure  28.  Throughout  the  algorithm,  the  decision 
tree  is  constructed  with  each  non-terminal  node  representing  the  selected  attribute  on 
which  the  data  was  split,  and  terminal  nodes  representing  the  class  label  of  the  final 
subset  of  this  branch.  The  AKCA  decision  trees  can  serve  as  look  up  table  and  facilitate 
verification  and  validation  of  a  kinematic  configuration  of  arms,  namely,  helps  in 
acceptance  or  rejection  of  a  kinematic  arm  configuration  suggested  by  the  image 
processing  algorithms  detecting  passenger's  arm  kinematic  configuration.  Note  that, 
certain  arm  configurations  are  inadmissible  -  kinematically-  speaking,  e.g.,  hand  is 
dexterous,  but  its  degree  of  freedom  becomes  limited  under  certain  kinematic 
configuration  of  its  arm.  Figure  29  illustrates  another  decision  tree  generated  using 
1D3  algorithms  that  summarized  the  logical  for  determination  of  which  hand  gestural 
configuration  is  admissible. 
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Figure  27.  A  generalized  decision-tree  of  kinematically  admissible  arm  postural  configurations. 


ID3<D.  Target.  Atls) 

returns:  a  decision  tree  that  correctly  classifies  the  given  examples 
variables 

O:  Training  set  of  examples 

Target:  Attribute  whose  value  is  to  be  predicted  by  the  tree 

Arts:  List  of  other  attributes  that  may  be  tested  by  the  learned  decision  tree 

create  a  Root  node  for  the  tree 
if  D  are  all  positive  then  Root  4-  ♦ 
else  if  D  are  all  negative  then  Root  4-  - 

else  if  Attrs  s  0  then  Root  4-  most  common  value  of  target  in  D 

else 

A4-  the  best  decision  attribute  from  Atts 
root  4-  A 

for  each  possible  value  v,  of  A 

add  a  r>ew  tree  branch  with  Azv, 

Dvi  4-  subset  of  D  that  have  value  v,  for  A 

if  Dvi  s  0  add  then  leaf  4-  most  common  value  of  Target  in  D 

else  add  the  subtree  I03<  Dvi,  Target,  AttS‘{A} ) 

Figure  28.  Pseudo-code  description  of  IDS  decision-tree  algorithm. 
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Figure  29.  A  generalized  decision-tree  of  kinematically  admissible 
Hand-down  gestrual  configurations. 

6.4  Development  of  New  Algorithms  for  Vehicle  Windows  Detection  -  In  this  feasibility 
study,  we  considered  situations  when  the  vehicle  under  surveillance  is  immobile  and 
activities  are  taking  place  inside  the  vehicle.  Given  a  perspective  view  of  the  vehicle, 
the  objective  is  to  determine  that  section  of  vehicle  corresponding  to  windows  with 
visibility  to  see  through  the  vehicle.  There  are  many  variety  of  vehicle  are  available.  In 
this  project,  we  consider  a  few  vehicles  as  illustrated  in  the  following  figures.  Windows 
are  an  integral  part  of  vehicle  and  all  upper  portion  of  most  commercially  available 
vehicle  are  surrounded  by  windows  with  different  shape,  and  transparency.  The 
camera  observing  the  vehicle  is  assumed  to  be  located  above  the  height  of  the  driver 
inside  the  vehicle  and  therefore,  the  camera  is  looking  forward  with  slope  through  the 
window  to  observe  the  activities  of  vehicle  occupants.  Primarily,  we  considered  the 
driver  since  each  vehicle  is  driven  with  at  least  one  driver.  Other  vehicle  occupants  are 
optional  and  arbitrary.  Therefore,  and  foremost,  we  are  interested  to  characterize  the 
activity  of  the  driver  through  a  window  orientation  revealing  his/her  actions.  Figure 
30  presents  the  results  of  several  vehicle  edge  detection  Image  Processing  [IP) 
algorithms  developed  for  detection  my  lines  representing  the  structural  configuration 
of  the  vehicle.  A  typical  commercial  vehicle  has  many  surface  features  and  contains 
many  reflective  regions  because  of  curvature  of  its  body  surfaces,  parts,  and  corners.  A 
robust  edge-detector  algorithm  must  isolate  the  most  strongest  edges  representing  the 
geometrical  shape  of  the  vehicle  without  jeopardizing  the  structural  integrity  of  the 
vehicle  and  resulting  of  uncharacteristic  edge  formations.  Figures  31  and  32 
demonstrate  different  edge  detector  algorithms  developed  and  tested  for  the  purpose. 
Figures  33  thru  35  demonstrate  the  next  step  of  this  process  for  detection  of  front 
window  and  side  windows  of  the  vehicle. 


Page  23  of  49 


31 


Figure  30.  Result  of  first  image  processing  technique  for  vehicle  edge  detection. 


Figure  31.  Results  of  another  image  processing  technique  for  vehicle  edge  detection. 


Figure  32.  Results  of  third  image  processing  technique  for  vehicle  edge  detection. 
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Figure  34.  Results  of  fifth  image  processing  technique  for  detection  of  vehicle  side  window. 


Figure  35.  Results  of  sixth  image  processing  technique  for  detection  of 
vehicle  side  window  from  original  vehicle  binary  image. 
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6.5  Development  of  New  Algorithms  for  Human  Skin  Detection  -  Upon  detection  of 
vehicle  window  area,  the  next  step  is  to  isolate  head,  left,  and  right  arms.  The 
common  denominator  of  these  three  body  parts  is  the  skin.  Therefore,  we  embarked 
to  develop  a  robust  skin  detector  based  on  YCrCb  color  space.  Figure  36  illustrates 
the  regime  of  Cb  vs.  Y’-Cb  chart,  and  Cr  vs.  Y’-Cr  chart  selected  for  segmentation  of 
skin  colors.  These  two  chart  was  developed  heuristically  and  by  examination  of 
hundreds  of  different  color  skins  from  variety  of  ethnicity  groups  with  different  skin 
colors.  Any  point  in  the  YCrCb  that  can  be  projected  into  the  red  regions  of  charts  in 
the  Figure  37  corresponds  to  skin  point  that  our  algorithm  recognizes  it.  Figure  37, 
a  picture  from  Wikipedia  website  used  for  testing  of  this  newly  developed  skin 
detector.  Prior  to  applying  this  filter,  noise  variation  of  the  origin  image  needs  to  be 
suppressed.  We  tried  different  digital  filter  for  this  purpose  including  Median, 
Bilateral,  and  Circular  filters  that  we  found  are  the  most  effective  for  suppressing 
RGB  color  noises  and  resulting  more  consistent  and  smoother  skin  colors.  Figure  38 
illustrates  the  effectiveness  of  our  newly  developed  skin  detector  for  variety  of 
people  with  different  ethnicity  background  and  skin  colors  with  partial  shading 
effects.  Selection  of  the  noise  suppression  filter  is  critical,  since  overdoing  of  this 
option  suppressed  equally  many  facial  features  that  appear  relative  small  in  the 
pictures.  Figure  39  demonstrated  the  strength  of  this  skin  color  detector  for  people 
from  the  same  ethnicity  but  with  different  size,  shading,  and  partial  obstructions  and 
occlusions.  Figure  40  shows  the  effectiveness  of  this  approach  for  a  face  that  is 
partial,  but  completely  darkened  by  the  shadow. 
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Figure  36.  Charts  of  Y-Cr  and  Y-Cb  spaces  for  skin  color  detection. 


Page  26  of  49 


34 


Figure  37.  Results  of  skin  detector:  Original  (Upper  Left),  Large  Median  Filter+Skin  Detector  (Upper 
Right),  Bilateral  Filter+Skin  Detector  (Lower  Left),  Circular  Filter+Skil  Detector  (Lower  Right)  - 

Courtesy  of  Wikipedia. 


Figure  38.  Results  of  different  Samples  from  variety  of  different  ethnicity  groups. 
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Figure  40.  Results  of  skin  detection  techniques  for  a  bi-modal  image  -  partly  lighted  up,  partly 
shadowed.  Note  the  hat  is  detected  a  part  of  the  skin. 


6.6  Development  of  A  New  Algorithm  for  Isolation  of  Human  Body  Part  From  Dynamic 
Videos  -  The  skin  detector  was  also  tested  on  some  YouTube  downloaded  videos 
demonstrating  unstructured  activities  inside  vehicles.  Figures  41  and  42  illustrated 
the  robust  of  this  newly  developed  skin  detector  for  real-time  video  processing 
applications.  Note  that  the  computational  efficiency  of  this  skin  detector  is  high  and 
it  can  be  readily  applied  for  real-time  video  processing.  In  these  two  demonstrated 
video,  we  were  about  to  perform  25  frames  per  second  this  option  for  image  size  of 
320x240  pixels.  The  videos  demonstrated  in  the  figures  41  and  42  contain 
significant  overshadowing  effects  and  activities  are  rather  dynamic.  The  video  in 
the  figure  42  only  shows  arms/hands  movement  for  assembling  of  a  holder  object 
that  is  mounted  inside  of  the  vehicle  for  support  of  an  iPad. 
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Figure  42.  Results  of  Another  Example  of  Skin  detectors  based  on  a  YouTube  Video  Demonstrating 

An  in  Vehicle  Activity. 


6.7  Development  of  A  New  Algorithm  for  Tracking  Human  Body  Parts  Tracking  -  Upon 
detection  of  skinned  body  parts  of  an  occupant,  the  next  step  is  to  isolate  each  body 
part  and  identify  their  kinematic  configuration,  namely,  their  orientation.  We  use  a 
straightforward  logic  to  isolate  these  parts.  The  head  correspond  to  that  blob  that 
satisfy  certain  aspect  ratio  and  most  likely  located  above  other  body  parts.  The  left  and 
right  arms  are  differentiated  by  associating  the  nearby  blobs  two  form  most  likely 
representative  of  the  left  and  right  arm.  If  camera  is  observing  through  the  driver  side, 
the  left  arm  is  considered  that  blob  is  the  below  the  head,  meets  certain  aspect  ratio, 
and  it  is  the  larger  of  the  other  major  blob  below  the  head.  Using  this  logically. 
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automatically,  the  second  major  blob  located  below  the  head  is  considered  as  the  right 
arm.  Figures  44-49  show  the  results  of  tracking  of  the  body  parts  in  different 
YouTube  iVGA  videos.  Figures  50  and  51  presents  other  YouTube  iVGA  video 
processed  for  the  purpose  of  testing  and  training  of  iVGA  image  processing  algorithms. 


Figure  43.  Results  for  detection,  isolation,  and  characterization  of  orientation  of  head  and  arms. 


One  Arm 
Visible 


Arms  Range 
of  Motion 


Figure  44.  Results  for  tracking  of  body  parts  through  sequential  body  parts. 
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Tracking  Head,  Arms,  and  Hands 


Figure  45.  Results  for  tracking  of  body  parts  through  sequential  body  parts  for  YouTube  Video  1. 

This  video  is  the  courtesy  of  YouTube. 


Tracking  Working  Hands 


Figure  46.  Results  for  tracking  of  body  parts  through  sequential  body  parts  for  YouTube  Video  2. 

This  video  is  the  courtesy  of  YouTube. 
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Figure  47.  Results  for  tracking  of  body  parts  through  sequential  body  parts  for  YouTube  Video  2. 

This  video  is  the  courtesy  of  YouTube. 


Figure  48.  Results  for  tracking  of  body  parts  through  sequential  body  parts.  This  video  is  the 

courtesy  of  YouTube. 
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Figure  49.  Results  new  image  processing  algorithm  on  different  YouTube  Videos.  Each  pair  of 
images  demonstratea  the  situation  where  tracking  of  body  parts  is  failed.  These  videos  are  the 

courtesy  of  YouTube. 


http://voutu.be/uQCpJEW5eGE 

(0:30-0:40) 


http://voutu.be/WB6mp3$V  Ww 


http://voutu.be/iDliUHMci3w 

(0:30-2:00) 


http://voutu.be/hhmiJzl5caA 


http://voutu.be/Lxa6IF-ll3c 

(arms/steeringwheel) 


http://voutu.be/aLqsrb  8G5w  http://voutu.be/xHJY8rs5JaY  http://voutu.be/Ex3nBWI5F  w 


(short  video,  sleeping)  (0:26-0:33,  sleeping) 


http://voutu.be/alOQKikeS6U 

Figure  50.  A  sample  of  different  YouTube  videos  used  for  testing  and  evaluating  the  newly 

development  iVGA  algorithms. 
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http://voutu.be/lXPeioh6tcO 


http://www.YOutube.com/wjtch?v=WZif  httD://voutU.be/HWEbSX8XT0l 

dx5GYoStc.-aturc=sharc&IKt=PLF024002DBB7D5D94 


http://voutu.be/VR91PNKIXVo 

(people  smoking,  talking) 


http://voutu.be/  zSSlSpJFQ 

(convertible,  man  puttingon 
hats) 


http;//yQutu,be/evL-SiuJ8p4 


bttp;//YOutu,be/hU4rxRpYPZQ 

(side  view,  dooropen  Jong  video) 


http://voLitu.be/4NfMdUs29uw 

(3:0S-4:45,  side  view,  talking) 


http://voutu.be/CVSwl2oATnQ 


http://voutu.be/70C8hATI  Se 

(1:25-2:40,  side  view,  seated) 


http://voutu.be/hhCQsVLwREk  http://voutu.be/747vPuH-8QM  http://voutu.be/Lxa6IF-IHc 


http://voutu.be/SNb8iifGe6c  http://voutu.be/XoxksSIJOxc  http://voutu.be/Gbl3dJP9bMo 

Figure  51.  Another  sample  of  different  YouTube  videos  used  for  testing  and  evaluating  the  newly 

development  iVGA  algorithms. 
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6.8  Development  of  A  New  Algorithm  for  Initial  Detection  of  Human  Body  Parts  Based 
on  SURF  Image  Processing  Detection  -  To  detect  motion  we  developed  a  fast  and 
robust  algorithm  based  on  a  technique  called  SURF.  SURF  stands  for  Speeded  Up 
Robust  Features  is  a  robust  local  feature  detector,  first  presented  by  Herbert  Bay  et  al. 
in  2006  [3].  It  can  be  used  in  computer  vision  tasks  like  object  recognition  or  3D 
reconstruction.  It  is  partly  inspired  by  the  SIFT  descriptor.  The  standard  version  of 
SURF  is  several  times  faster  than  SIFT  and  claimed  by  its  authors  to  be  more  robust 
against  different  image  transformations  than  SIFT.  SURF  is  based  on  sums  of  2D  Haar 
wavelet  responses  and  makes  an  efficient  use  of  integral  images.  SURF  uses  an  integer 
approximation  to  the  determinant  of  Hessian  blob  detector,  which  can  be  computed 
extremely  quickly  with  an  integral  image  (3  integer  operations).  For  features,  it  uses 
the  sum  of  the  Haar  wavelet  response  around  the  point  of  interest.  Again,  these  can  be 
computed  with  the  aid  of  the  integral  image.  A  summed  area  table  is  a  data  structure 
and  algorithm  for  quickly  and  efficiently  generating  the  sum  of  values  in  a  rectangular 
subset  of  a  grid.  In  the  image  processing  domain,  it  is  also  known  as  an  integral  image. 
It  was  first  introduced  to  computer  graphics  in  1984  by  Frank  Crow  for  use  with 
mipmaps.  In  computer  vision  it  was  first  prominently  used  within  the  Viola-Jones 
object  detection  framework  in  2001.  However,  historically,  this  principle  is  very  well 
known  in  the  study  of  multi-dimensional  probability  distribution  functions,  namely  in 
computing  2D  [or  ND)  probabilities  [area  under  the  probability  distribution]  from  the 
respective  cumulative  distribution  functions  [4].  Moreover,  the  summed  area  table  can 
be  computed  efficiently  in  a  single  pass  over  the  image,  using  the  fact  that  the  value  in 
the  summed  area  table  at  [x,y]  is  just: 

l{x,y)  =i{x,y)-\-l(x-  l,y} -\- l{x,y  -  1)  -I{x-  l,y  -  1) 

Figure  52  demonstrates  adaptive  SURF  features  learning  in  real-time.  The  SURF 
algorithm  is  implemented  in  IRIS  image  processing  library  [IRIS  is  a  complete  and  fast 
image  processing  library  developed  by  PI  for  robotics  image  processing  applications]. 
The  IRIS  maintains  a  host  of  512  SURF  features  from  a  typical  video  stream.  From  the 
first  few  images  of  video,  SURF  reported  features  are  sorted  and  memorized.  Any 
subsequent  features  are  then  compared  in  real-time  against  these  512  SURF  features. 
Those  features  that  have  the  same  SURF  properties  are  disregarded  since  they  can  be 
assumed  to  have  belonging  to  the  background.  However,  the  new  features  that  the 
algorithm  can  find  with  no  feature  match  are  considered  as  newer  features  that  may 
contain  the  foreground  object.  IRIS  uses  an  exclusive  algorithm  to  discriminate  the 
new  features  and  eliminated  those  weak  features  that  unlikely  representing 
foreground.  The  newer  feature  found  having  significant  content  belonging  to 
foreground  are  stored  in  a  different  learning  stack  and  used  for  fast  testing  against 
future  video  imagery  frames.  Figure  53  illustrates  the  results  of  this  latter  image 
processing  techniques  for  separation  of  foreground  objects  from  memorized 
background  objects. 
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SURF  Techniques  For 
New  Hand  Motions  Detection 


Figure  52.  Results  for  SURF  algorithm  in  IRIS  application  for  detection  of  moving  body  parts. 

SURF  Technique  Used  For 
HAH  Motions  Tracking  and  Learning 


Training  Space  of 


Figure  53.  Results  for  SURF  algorithm  in  IRIS  application  for  detection  of  foreground  objects  from 

background  objects. 


6. 9  Development  of  Another  Robust  Algorithm  for  Hand  Color  Skin  Detection  Based  on 
RBG  Mean-Shift  Technique  -  To  further  boost  the  performance  of  skin  color  detection, 
we  developed  three  other  hand  skin  color  detector  based  on  robust  cluster  technique. 
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For  the  purpose,  we  collected  255  hand  gestures  in  different  configuration  as 
illustrated  in  Figure  54.  In  addition  to  these  255  hand  pictures,  and  collected  45  other 
pictures  shows  two  close  or  over-lapping  hands.  The  three  skin  color  detectors  are 
based  on:  Very  fast  RGB  Down  Sampling  technique,  fast  RGB  Mean-Shift  technique,  and 
low  RGB  Kmeans  RGB  Clustering  technique.  Figure  55,  56,  and  57  illustrate  the 
resultant  of  these  three  techniques  in  the  order  aforementioned  above.  As  you  may 
notice,  the  RGB  down  sampling  technique  is  fast,  yet,  the  background  and  foreground 
are  not  very  separated.  The  RGB  Mean-Shift  technique  is  not  as  fast  as  RGB  down 
sampling  technique,  yet  it  yielded  very  respectable  results  and  segmentation  of 
foreground  and  background  is  less  tedious  with  this  approach.  The  Kmeans  clutter 
technique  is  found  the  slowest  of  all  three  techniques,  yet  it  yields  very  respectable 
results  and  foreground  and  background  can  be  readily  separated.  The  second 
approach  is  the  best  compromised  in  terms  of  performance  and  speed  and  therefore, 
we  choose  this  technique  for  the  RGB  clustering  of  hands  from  the  background  and 
isolating  them  for  this  configurational  characterization  via  the  following  algorithms 
that  follow. 


Hands  Configuration  Space 


■  Gestures  are  a  form  of  nonverbal 
communication  in  which  visible 
bodily  actions  are  used  to 
communicate  important  messages, 
either  in  place  of  speech  or  together 
and  in  parallel  with  spoken 
words.t'"'*^'J 

■  Gestures  are  culture-specific 

■  Convey  very  different  meanings  in 
different  social  or  cultural  settings 

■  Gesture  is  distinct  from  sign 
language 


Figure  54.  A  Sample  Space  of  Selected  Hand[s)  Configurations  for  Skin  Color  Detection  and  Gesture 

Training/Learning. 
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Hand  Skins  Color  Clustering  Via 
RGB  Down-Sampling  Technique 


Original  Hand  Gestures 


Clustered  Hand  Gestures 
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Advantage:  Fast  &  Efficient 

Disadvantage:  Large  Color  Variability 


Figure  55.  The  result  of  hand  training  sample  space  of  before  [left)  and  after  [right) 
skin  color  detection  based  on  very  Fast  RGB  Down  Sampling  Technique. 


Hand  Skins  Color  Clustering  Via 
RGB  Mean  Shift  Technique 


Original  Hand  Gestures  Clustered  Hand  Gestures 


Figure  56.  The  result  of  hand  training  sample  space  of  before  (left)  and  after  (right) 
skin  color  detection  based  on  Fast  RGB  Mean  Shift  Technique. 
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Hand  Skins  Color  Clustering  Via 
Kmeans  Technique 


Original  Hand  Gestures 
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Figure  57.  The  result  of  hand  training  sample  space  of  before  [left)  and  after  (right) 
skin  color  detection  based  on  Slow  Kmeans  RGB  pixels  Cluster. 


6.10  Development  of  Two  Competing  Techniques  for  Hand  Gesture  Classification  -  After 
detection  of  hand  skin,  as  demonstrated  in  the  previous  sections,  we  developed  two 
competing  techniques  for  hand  gesture  classification.  The  first  technique  is  based  on 
Recursive  Hamming  Neural  Network,  and  the  second  technique  is  based  on  Random 
Field  Trees.  Initially,  we  labeled  the  sample  space  of  all  our  hand  training  models.  We 
divided  our  entire  hand  sample  space  into  twelve  different  classes.  Each  class  contains 
a  finite  number  of  hand  configurations  as  illustrated  in  Figures  58,  59,  and  60.  The 
twelve  hand  configuration  classes  are:  Touching,  Holding,  Writing,  Opening,  Fisting, 
Pausing,  Gesturing,  Cupping,  Pinching,  Praying,  Pointing,  and  Texting.  We  note  that  the 
class  praying  contains  two-hand  formations  representing  praying  hand  configuration. 
We  also  note  that  a  single  hand  in  any  of  the  given  praying  configuration  is  not 
conclusive  of  praying.  In  order  to  alleviate  this  misclassification,  we  resort  to  consider 
only  two-hand  configuration  as  a  state  of  hand-praying.  For  the  former  classifier, 
namely,  hamming  neural  network  (HNN],  we  generated  a  batch  of  invariant  features  of 
hand  samples  for  the  training  of  the  network.  The  HNN  based  on  degree  of  closest  of 
two  patterns  measured  based  on  hamming  distance  determine  the  best  class 
representing  a  test  pattern.  Figure  61  presents  the  Hamming  neural  network  model.  In 
information  theory,  the  Hamming  distance  between  two  strings  of  equal  length  is  the  number 
of  positions  at  which  the  corresponding  symbols  are  different.  In  another  way,  it  measures 
the  minimum  number  of  substitutions  required  to  change  one  string  into  the  other,  or  the 
minimum  number  of  errors  that  could  have  transformed  one  string  into  the  other.  The 
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invariant  features  considered  for  training  of  HMM  includes  seven  Hu  invariant  moments, 
texture  features  (variance,  energy,  entropy,  and  homogeneity),  and  three  RGB  features.  For 
the  second  classifier,  the  coefficients  of  Zernike  Polynomial  was  used  as  features  of  hand, 
and  a  Random  Forest  Tree  classifier  was  used  for  classification  of  Zernike  Polynomial 
feature  vectors.  Figure  62  presents  the  Random  Forest  Tree  Classifier  operating  based  on 
Zernike  Polynomial  Coefficients.  In  mathematics,  the  Zernike  polynomials  are  a  sequence 
of  polynomials  that  are  orthogonal  on  the  unit  disk  [5].  There  are  event  and  odd  Zernike 
polynomials.  The  even  ones  are  defined  as 

^n(P,  P)  =  Kip)  COE(m^) 

and  the  odd  ones  as 


where  m  and  n  are  nonnegative  integers  with  n>m,(pis  the  azimuthal  angle,  p  is  the  radial 
distance  0  ^  P  ^  1,  and  R'^„  are  the  radial  polynomials  defined  below.  Zernike 
polynomials  have  the  property  of  being  limited  to  a  range  of  -1  to  +1,  i.e. 

I  (Pj  I  —  1.  The  radial  polynomials  are  defined  as 


Kip)  =  E 

h=0 


(-l)*=(n-ft)! 


P 


n  —  2k 


for  n  -  m  even,  and  are  identically  0  for  n  -  m  odd. 


Random  forests  are  an  ensemble  learning  method  for  classification  (and  regression) 
that  operate  by  constructing  a  multitude  of  decision  trees  at  training  time  and 
outputting  the  class  that  is  the  mode  of  the  classes  output  by  individual  trees  [6].  In  the 
essence,  this  random  forest  technique  constructs  a  collection  of  decision  trees  with 
controlled  variance.  The  training  algorithm  for  random  forests  applies  the  general 
technique  of  bootstrap  aggregating,  or  bagging,  to  tree  learners.  Given  a  training  set  X  = 
xi, ...,  Xn  with  responses  Y =yi  through bagging  repeatedly  selects  a  bootstrap  sample 
of  the  training  set  and  fits  trees  to  these  samples: 


For  h  =  1  through  B: 


1.  Sample,  with  replacement,  n  training  examples  from  X,  Y;  call  these  Xt,  Yb. 

2.  Train  a  decision  or  regression  tree  ft  on  Xb,  Yb. 


After  training,  predictions  for  unseen  samples  x'  can  be  made  by  averaging  the 
predictions  from  all  the  individual  regression  trees  on  x'\ 
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^  b=l 

or  by  taking  the  majority  vote  in  the  case  of  decision  trees. 

In  the  above  algorithm,  B  is  a  free  parameter.  Typically,  a  few  hundred  to  several 
thousand  trees  are  used,  depending  on  the  size  and  nature  of  the  training  set. 
Increasing  the  number  of  trees  tends  to  decrease  the  variance  of  the  model,  without 
increasing  the  bias.  As  a  result,  the  training  and  test  error  tend  to  level  off  after  some 
number  of  trees  have  been  fit.  An  optimal  number  of  trees  B  can  be  found  using  cross- 
validation,  or  by  observing  the  out-of-bag  error:  the  mean  prediction  error  on  each 
training  sample  xD,  using  only  the  trees  that  did  not  have  xD  in  their  bootstrap  sample. 
Figure  63  presents  the  performance  evaluation  results  of  these  two  classifiers.  The 
Random  Forest  with  Zernike  Polynomial  Coefficient  feature  vector  training  has  shown 
1.9  percent  better  classification  power  over  the  HNN  with  invariant  hand  gesture 
feature  vectors. 


Touching 


Writing 


Holding 


Opening 


Figure  58.  Four  hand  configurations:  Touching,  Holding,  Writing,  and  Opening. 
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Figure  59.  Four  hand  configurations:  Fisting,  Pausing,  Gesturing,  and  Cupping. 


Pointing 


Texting 


Figure  60.  Four  hand  configurations:  Pinching,  Praying,  Pointing,  and  Texting. 
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Figure  61.  Hamming  Neural  Network  for  Classification  of  Hand  Samples. 


Figure  62.  Random  Forest  Tree  Classifier  With  for  Classification  of  Hand  Samples  Based  on  Zernike 

Polynomials  Coefficients. 
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Performance  Comparison  of  Two  Hand-Gesture  Classifiers 
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■  Zernike  Moments  +  Random  Forest  Trees  (ZM-RFT) 

■  Invariant  Shape  Features +  (ISF-HNN) 

Average  ZM-RFT  Classification  Performance  =  94.4% 
Average  ISF-HNN  Classification  Performance  =  92.6% 

ZM-RFT  Performance  Gain  Over  ISF-HNN  =  1.9% 


Figure  63.  Result  of  performance  comparison  of  two  hand  gesture  classifiers. 


6.11  Development  of  a  Frame  Work  for  Enhancement  of  This  Feasibility  Study  Toward 
Phase  I  of  this  Project  -  Based  on  the  learning  achieved  in  this  feasibility  study  project, 
a  new  architectural  framework  was  developed  as  illustrated  in  Figure  64.  This 
framework  considers  four  main  components:  Visual  Processing,  Visual  Training,  a 
Gesture  Recognition  System,  and  Visual  Perceptions  Grouping.  The  Visual  processing 
component  detects  and  tracks  the  variance  of  imagery  data  overtime.  The  training 
component,  on  the  other  hand,  provides  the  necessary  trained  taxonomy  and  ontology 
for  recognition  of  certain  known  in-Vehicle  activities.  The  role  of  Gesture  recognition 
system  component  is  to  classify  and  discriminate  normal  gestures  from  anomalous 
gestures.  Each  recognized  gesture  pattern  compels  a  new  order  of  perception  that 
needs  to  be  analyzed  and  modeled  to  entail  it  a  salient  gesture  to  annotate.  The  process 
begins  first,  by  detecting  salient  events,  then  it  become  a  matter  of  recognizing  actions 
that  each  action  is  composed  of  multiple  salient  events.  An  array  of  associated  and 
correlated  actions  represents  an  activity.  Therefore,  to  recognition  iVGA,  events 
associated  with  each  occupant  needs  to  be  investigated,  analyzed,  and  scrutinized.  The 
proposed  framework  will  be  able  to  fulfill  this  objective.  This  architecture  allows 
recommends  a  fuzzy  logic  context  reasoner  as  an  arbitrator  of  contextual  activities 
exhibited  by  the  in-vehicle  occupants.  Furthermore,  this  architecture  framework 
poses  a  mechanism  for  annotating  newly  discovered  information.  Humans  are  more 
accustomed  to  reading  reporting  and  understanding  situations  from  written  reports. 
By  continual  generating  iVGA  annotations,  one  can  trace  the  results  of  observed  events, 
actions,  and  activities  overtime  and  even  further  interrogate  what  has  happened  based 
on  the  significance  of  generated  annotations.  The  annotated  reports  also  offer 
opportunities  for  visual  data/information  analytics  and  further  intelligence  processing 
and  pursues. 
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Figure  64.  A  proposed  Architectural  framework  for  in-Vehicle  Group  Activity  [iVGA]  recognition. 


7.  Scientific  Barriers 

There  are  a  number  of  scientific  barriers  complicating  the  spatiotemporal  processing  of 
partially  observable  group  activities.  These  barriers  are  related  to  processing  of 
obscured  imagery  data  representing  fragmented  gestural/postural  configurations  - 
specially  when  they  occur  in  tight  confined  spaces  [e.g.,  inside  a  parked  vehicle); 
spatiotemporal  understanding  of  postural  gestures  [i.e.,  body,  arms,  and  hands  gestures) 
in  order  to  predict  what  activities  are  taking  place  in  confined  spaces  despite  of  present 
of  environment  clutter;  systematic  development  of  apt  ontology  for  discrimination  of 
normal  and  abnormal  behavioral  patterns;  and  spatiotemporal  linking  of  partial 
information  under  task  uncertainties.  Other  scientific  barriers  are  related  to  sustainment 
of  a  continuous  sensing  to  perception  in  lieu  of  significant  operational  noise  factors  and 
inevitable  spatiotemporal  variances  of  gestural  and  postural  configurations. 


8.  Scientific  Significance 

The  scientific  significance  of  this  project  is  related  to  adaptive  learning,  incremental 
comprehension,  constructive  perception  modeling,  and  knowledge -based  anticipation 
modeling  of  group  activities  occurring  in  small  spaces.  This  project  establishes  a  sound 
information-theoretic  framework  for  multi-target  characterization,  data  referencing, 
opportunistic  sensing  via  robust  visual  analytics  inspired  by  neurosciences,  fuzzy-logic 
processing,  and  behavioral  patterns  learning  and  modeling,  and  uncertainty  handling  via 
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apt  information  fusion  techniques  with  appropriate  performance  metrics  for  evaluation 
of  alternative  solutions. 


9.  Scientific  Accomplishments 

This  project  is  intended  for  multiple  phases.  This  report  presents  our  initial  feasibility 
study  of  this  challenging  project.  In  this  initial  feasibility  study,  a  systematic  approach  for 
analysis  of  target  entities  operating  inside  a  vehicle,  while  being  observed  by  a  remote 
surveillance  camera,  was  performed.  This  analysis  established  a  set  of  sound  ontologies 
for  state-space  representations  of  target  entities’  postural  configurations  inside  a  vehicle 
as  well  as  their  physical  limitation  -  particularly,  in  terms  of  kinematic  constraints  that 
may  result  impossible/unrealistic  postural  configurations  of  arms  during  normal 
activities/operations.  To  better  visualize  the  admissible  and  impermissible  postural 
configurations,  a  virtual  model  of  a  humanoid  was  developed  using  IRIS  software 
developed  by  the  PI.  The  humanoid  model  is  a  32-degree-of-freedom  robotic  model  with 
kinematic  motions  similar  to  those  of  humans.  This  model  was  utilized  to  simulate 
different  human  postural  configurations  in  the  virtual  environment  and  develop  suitable 
ontology  for  verification  of  admissible  and  inadmissible  postural  configurations. 

Furthermore,  we  developed  suitable  image  processing  (IP]  techniques  facilitating 
automatic  background  segmentation,  hands,  arms,  and  head  detection  and  tracking. 
Several  techniques  were  explored  for  skin  tune  detection  and  segmentation.  In  occluded 
spaces,  there  is  large  variety  of  shading  variation  that  compromises  the  skin  tune 
reflections. 


10.  Collaboration  and  Leveraged  Funding 

The  PI  is  a  Co-Pl  of  an  ARO-supported  MURl  project  entitled  "’Network-based  Hard-Soft 
Sensor  Information  Fusion”.  The  MURl  project  is  let  by  University  of  Buffalo.  Dr.  John 
Lavery  is  the  Program  Manager  of  this  MURl  project.  Some  funding  from  MURl  is 
leveraged  partially  toward  support  of  graduate  and  undergraduate  students  currently 
working  on  this  project.  These  students  are  contributing  both  toward  the  research 
objective  of  project  as  well  as  toward  the  research  objective  of  our  on-going  MURl 
project.  The  MURl  project  expires  in  July  31,  2014. 


11.  Technology  Transfer 

This  project  has  held  13  monthly  teleconference  meetings  with  the  ARL  technical 
monitors  at  ARL.  The  ARL  technical  monitors  on  this  project  are:  Dr.  Alex  Chan,  and  Dr. 
Shuowen  Hu.  This  frequent  interaction  with  ARL  has  been  significantly  instrumental  to 
transfer  technology  and  this  cooperative  technology  transfer  effort  will  be  maintained  to 
assure  the  successful  achievement  of  this  mission. 


12.  Future  Research  Plan 
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In  the  second  phase  of  this  project  we  plan  to  develop  a  fuzzy-logic-based  inference 
engine  for  inferring  postural  configurations  based  on  spatiotemporally  processed  visual 
imagery  data.  This  fuzzy  inference  engine  enables  fuzzy  reasoning  of  postural 
configurations  under  visual  perception  uncertainties.  In  the  third  phase  of  this  project 
we  will  develop  the  proposed  multi-layer  Hidden  Markov  Model,  for  probabilistic  state- 
transition,  and  hence  group  activity  recognition  under  partial  visual  observation 
constraints.  At  final  stage  of  this  project,  we  will  develop  a  system  for  semantic 
annotation  of  group  activities  occurring  in  confined  spaces. 


13.  Anticipated  Scientific  Accomplishments 

The  anticipated  scientific  accomplishments  include:  (1)  robust  image  processing, 
dynamic  body  (i.e.,  head,  arms,  and  hands)  postural/gestural  configuration  tracking 
under  occlusion  and  imagery  constraints,  (2)  fuzzy  reasoning  of  spatiotemporal  physical 
postures/gestures  for  normal  and  anomaly  discriminatory  behavioral  pattern 
recognition,  (3)  dynamic  group  activity  recognition  via  probabilistic  multi-layer  Hidden 
Markov  Model,  and  (4)  semantic  annotation  of  group  activities  in  confined  spaces. 


14.  Students  Involvements 

This  project  has  involved  a  number  of  high -school,  undergraduate.  Master,  and  Ph.D. 
students  including 

•  Vinayak  Elangovan  (Ph.D.)  -  Part  time  Spring  2014 

•  Azin  Poshtkar,  (Master  Grad)  Full  time  Spring  and  Summer  2014 

•  Ayele  Tegegne  (UG)  -  Part  time,  in  Summer  2014,  a  Navy  Veteran 

•  David  A.  Potter  Jr.  (UG)  -  Part  time,  in  Summer  2014  -  An  Army  Veteran 

•  Pedro  Henrique  Tavares  (UG)  -  Part  time,  in  Summer  2014 

•  Brent  Warner  (UG)  -  Part  time,  in  Summer  2014,  An  Anny  Helicopter  Veteran 

•  Ramon  Gonzalez  (UG)  -  Part  time,  in  Summer  2013,  Graduated  in  spring  2014 

•  Daniel  Allen  (UG)  -  Part  time,  in  Summer  2013,  Graduated  in  spring  2014 


High  School  Summer  Interns 

•  Jalyn  Edmundson  (Lighthouse  Christian  School  High-School) ,  Part  Time  in  Summer  2013 

•  Freeman  Johnson  (Martin  Luther  King  Magnet  High-School) ,  Part  Time  in  Summer  2013 

•  Zheer  Ahmed  (Martin  Luther  King  Magnet  High-School) ,  Part  Time  in  Summer  2013 


15.  Conclusion 

This  project  presents  a  technical  approach  for  detection  and  recognition  of  in-VehicIe 
Group  Activities  {iVGA)  in  confined  obstructed  spaces.  Robustness  in  recognition  of  such 
activities  can  significantly  benefit  home-land  security  as  well  as  battlefield  automated 
surveillance  and  reconnaissance  operations.  Particularly,  this  project  establishes  a 
technical  bridge  for  achievement  of  this  goal. 
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